A toponym-based dual vector for topical relevance calculation in focused spatial crawling
نویسندگان
چکیده
Focused crawler is a Web crawler that tries to download only pages that are relevant to a given topic of interest (Siemiński 2009, Almpanidis 2011). That is to say, it is necessary for focused crawler to calculate relevance between pages and specific topic (Rungsawang, 2005). Recently, the specific topic involving spatial information especially toponyms such as the topic about the Diaoyu Islands conflict between China and Japan has become much more, because there are about 18 percent webpages describing localization information and 70 percent webpages containing geographic information webpages in the worldwide webpages (Zhou et al. 2005, Hill 2009). For the given topic involving toponyms, it means that focused crawler should ensure the downloaded pages are not only relevant with common topic and also relevant with the toponyms, which requires more accurate topical relevance calculation. At present, most researchers adopt Vector Space Model (VSM) (Fox 1984, Batsakis et al. 2009) or improved VSM (Yang et al. 2010, Li 2008) to calculate the topical relevance. In these methods the given topic and document are regarded as one keywords vector, which means that toponyms are consider as common topic. Considering toponyms as common topic may reduce the crawling accuracy, because a toponym can involve different common topics and a common topic also can relate to many toponyms. To solve the problem, we make a difference between toponyms and common keywords according to the idea of Geographic Information Retrieval (Purves 2007) and develop a toponym-based dual vector for topical relevance calculation.
منابع مشابه
Prioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملIntelligent Event Focused Crawling
There is need for an integrated event focused crawling system to collect Web data about key events. When an event occurs, many users try to locate the most up-todate information about that event. Yet, there is little systematic collecting and archiving anywhere of information about events. We propose intelligent event focused crawling for automatic event tracking and archiving, as well as effec...
متن کاملTopical web crawling for domain-specific resource discovery enhanced by selectively using link-context
To enable topical web crawling, link-context is the critical contextual information of anchor text for retrieving domain-specific resources. While some link-contexts may misguide topical web crawling and extract wrong web pages, because several relevant anchor texts become irrelevant or several irrelevant anchor texts become relevant after calculating the relevance between the link-contexts and...
متن کاملCollecte orientée sur le Web pour la recherche d'information spécialisée. (Focused document gathering on the Web for domain-specific information retrieval)
Focused document gathering on the Web for domain-specific information retrieval Vertical search engines, which focus on a specific segment of the Web, become more and more present in the Internet landscape. Topical search engines, notably, can obtain a significant performance boost by limiting their index on a specific topic. By doing so, language ambiguities are reduced, and both the algorithm...
متن کاملOn-line topical importance estimation: an effective focused crawling algorithm combining link and content analysis
Focused crawling is an important technique for topical resource discovery on the Web. The key issue in focused crawling is to prioritize uncrawled uniform resource locators (URLs) in the frontier to focus the crawling on relevant pages. Traditional focused crawlers mainly rely on content analysis. Link-based techniques are not effectively exploited despite their usefulness. In this paper, we pr...
متن کامل